Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid memory access when a super-time-stepping method is used #629

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

takasao
Copy link

@takasao takasao commented Oct 29, 2024

Using Valgrind and LLVM’s clang compiler with -fsanitize flag, I found invalid memory access when a super-time-stepping (STS) method was used. For example, resist.cpp demonstrates this problem. I have never recognized this problem before. I suspect that optimization options have some effects.

It was found that invalid memory access occurred when we referred to functions in time_integrator.cpp which have
stage_wghts from sts_task_list.cpp. For example,sts_task_list.cpp refers to
TaskStatus TimeIntegratorTaskList::DiffuseField
but stage_wghts is not prepared on the STS side, which causes the problem.

To avoid this problem, I have prepared new task functions that do not use stage_wghts for the STS methods. The following two files have been updated:

  • task_list.hpp
  • sts_task_list.cpp

I have copy-and-pasted relevant functions from time_integrator.cpp to sts_task_list.cpp and have removed stage_wghts from the functions. There will be a better way to avoid such redundant descriptions.

I didn't perform the regression tests that require fft, omp and cvode because my environment lacks these external libraries. But the code has passed the other tests.

Stopped referring to taks funcs in time_integrator.cpp which use "stage_wghts" from sts_task_list.cpp.
@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

@felker
Copy link
Contributor

felker commented Oct 29, 2024

ok to test

@felker
Copy link
Contributor

felker commented Oct 29, 2024

Thanks for finding this! I would like to make sure that there isn't a more elegant solution that can avoid all the code duplication.

Can you post the full Valgrind and sanitizer errors, and the compile + run commands you used to get them?

Maybe @pdmullen could help me understand when he gets a chance: I guess I don't understand how the SuperTimeStepTaskList class can even call the function pointers to the public member functions of TimeIntegratorTaskList, when there is no direct inheritance relationship:

  } else if (id == PHY_BVAL) {
    task_list_[ntasks].TaskFunc=
        static_cast<TaskStatus (TaskList::*)(MeshBlock*,int)>
        (&SuperTimeStepTaskList::PhysicalBoundary_STS);
    task_list_[ntasks].lb_time = true;
  } else if (id == DIFFUSE_HYD) {
    task_list_[ntasks].TaskFunc=
        static_cast<TaskStatus (TaskList::*)(MeshBlock*,int)>
        (&TimeIntegratorTaskList::DiffuseHydro);
    task_list_[ntasks].lb_time = true;
  } else if (id == DIFFUSE_FLD) {
    task_list_[ntasks].TaskFunc=
        static_cast<TaskStatus (TaskList::*)(MeshBlock*,int)>
        (&TimeIntegratorTaskList::DiffuseField);
    task_list_[ntasks].lb_time = true;

This is how all TaskFuncs are called:

ret = (this->*task_list_[i].TaskFunc)(pmb, stage);

So when pststlist->DoTaskListOneStage(pmesh, stage); is called in the main timestepping function, it makes sense that this->PhysicalBoundary_STS(pmb, stage) is defined, but this->DiffuseField(pmb, stage) should be undefined, no?

Would making a copy of stage_wghts[] in the SuperTimeStepTaskList class constructor fix this? Since it has a pointer to the defined TimeIntegratorTaskList object, this is simple (but we would need to change the visibility of stage_wghts[] from private to public or make a Friend class relationship):

SuperTimeStepTaskList::SuperTimeStepTaskList(
    ParameterInput *pin, Mesh *pm, TimeIntegratorTaskList *ptlist) :
    sts_max_dt_ratio(pin->GetOrAddReal("time", "sts_max_dt_ratio", -1.0)),
    ptlist_(ptlist) {

But then I worry that dereferencing the function pointers still wouldn't refer to this object. It seems that this pointer is basically unused in the STS task list class, except for making a copy of the nstages variable:

// currently intiialized but unused. May use it for direct calls to TimeIntegrator fns:
TimeIntegratorTaskList *ptlist_;

A more sure-fire way to fix this would be to override DoAllAvailableTasks() with an STS-specific version that can dereference this or ptlist_ depending on the task ID

@takasao
Copy link
Author

takasao commented Oct 29, 2024

Thank you for your investigation. Here is the setting and the result on Mac (Sonoma 14.7, M1 tip, Apple clang version 16.0.0 (clang-1600.0.26.3))
I configured the settings as follows:
python configure.py --prob resist -b -sts
and run by the following command:
path/to/bin/athena -i athinput.txt

Makefile.txt
log_clang.txt
athinput.txt

We have performed the test by valgrind on a different machine. I will provide the information about it later.

@takasao
Copy link
Author

takasao commented Oct 29, 2024

Regarding the check by Valgrind:
python configure.py --prob resist -b -sts
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=memory_log.txt /path/to/bin/athena -i athinput.txt -d output >> log_run
memory_log.txt

@felker
Copy link
Contributor

felker commented Nov 2, 2024

Thinking about this a bit more...

The original TaskList and TimeIntegratorTaskList class and inheritance design pre-dates me; maybe @tomidakn remembers the details of the original choices. The use of static_cast to upcast method pointers from TimeIntegratorTaskList methods to TaskList methods is totally fine as written:

static_cast<TaskStatus (TaskList::*)(MeshBlock*,int)>
(&TimeIntegratorTaskList::ReceiveEMFShear);

as long as the function pointers are only ever used on TimeIntegratorTaskList class objects. Even downcasting is fine and wouldnt need static_casts (typically less useful, however and Athena++ doesnt use it in this context):

The addition of STS in #176 was also safe by the C++ standard, since all the task class method pointers were to SuperTimeStepTaskList methods:

case (DIFFUSE_FLD):
task_list_[ntasks].TaskFunc=

The first time this inheritance design was extended in an unsafe way was in #257, a major 386-commit refactor, in which I removed the STS task functions that were identical to the original task functions, and I claimed in the writeup to change:

SuperTimeStepTaskList is now derived from TimeIntegratorTaskList and calls many of its parent class's TaskFunc instead of duplicating their implementations.

But I never did that! No idea if that was intentional/forgot to update the description, or if it was an oversight. The closest change to this I can find was passing a TimeIntegratorTaskList *ptlist to the ctor in
b5682c8 which doesnt seem used at all!

  • Should we just change the inheritance to class SuperTimeStepTaskList : public TimeIntegratorTaskList and initialize the stage_wghts[]? And remove that ptlist pointer from the class.

Despite the unsafe change, it worked fine in practice because all of the task functions that STS shared with the main time integrator class operated only on their explicit function parameters, the MeshBlock pointer and the current int stage (and also the int ntasks, nstages in the shared parent class TaskList). It still is certainly undefined behavior according to the standard, but someone with more C++ compiler knowledge would need to explain how the method function pointer resolution works in the object files.

The addition of orbital advection in #321 made it unsafe in practice because it extended struct IntegratorWeight with new coefficients and Boolean flags, and those flags were suddenly referenced in the "shared" task functions e.g.

if (stage_wghts[stage-1].main_stage) {

But even that has "been fine" for almost 4 years, since those conditionals likely worked correctly as the uninitialized values referenced by STS were never true!

@felker
Copy link
Contributor

felker commented Nov 5, 2024

retest this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants